4,652 research outputs found
Width Provably Matters in Optimization for Deep Linear Neural Networks
We prove that for an -layer fully-connected linear neural network, if the
width of every hidden layer is , where and are the rank and the condition number
of the input data, and is the output dimension, then
gradient descent with Gaussian random initialization converges to a global
minimum at a linear rate. The number of iterations to find an
-suboptimal solution is . Our
polynomial upper bound on the total running time for wide deep linear networks
and the lower bound for narrow deep
linear neural networks [Shamir, 2018] together demonstrate that wide layers are
necessary for optimizing deep models.Comment: In ICML 201
Efficient Nonparametric Smoothness Estimation
Sobolev quantities (norms, inner products, and distances) of probability
density functions are important in the theory of nonparametric statistics, but
have rarely been used in practice, partly due to a lack of practical
estimators. They also include, as special cases, quantities which are
used in many applications. We propose and analyze a family of estimators for
Sobolev quantities of unknown probability density functions. We bound the bias
and variance of our estimators over finite samples, finding that they are
generally minimax rate-optimal. Our estimators are significantly more
computationally tractable than previous estimators, and exhibit a
statistical/computational trade-off allowing them to adapt to computational
constraints. We also draw theoretical connections to recent work on fast
two-sample testing. Finally, we empirically validate our estimators on
synthetic data
Fast and Sample Efficient Inductive Matrix Completion via Multi-Phase Procrustes Flow
We revisit the inductive matrix completion problem that aims to recover a
rank- matrix with ambient dimension given features as the side prior
information. The goal is to make use of the known features to reduce sample
and computational complexities. We present and analyze a new gradient-based
non-convex optimization algorithm that converges to the true underlying matrix
at a linear rate with sample complexity only linearly depending on and
logarithmically depending on . To the best of our knowledge, all previous
algorithms either have a quadratic dependency on the number of features in
sample complexity or a sub-linear computational convergence rate. In addition,
we provide experiments on both synthetic and real world data to demonstrate the
effectiveness of our proposed algorithm.Comment: 35 pages, 3 figures and 2 table
When is a Convolutional Filter Easy To Learn?
We analyze the convergence of (stochastic) gradient descent algorithm for
learning a convolutional filter with Rectified Linear Unit (ReLU) activation
function. Our analysis does not rely on any specific form of the input
distribution and our proofs only use the definition of ReLU, in contrast with
previous works that are restricted to standard Gaussian input. We show that
(stochastic) gradient descent with random initialization can learn the
convolutional filter in polynomial time and the convergence rate depends on the
smoothness of the input distribution and the closeness of patches. To the best
of our knowledge, this is the first recovery guarantee of gradient-based
algorithms for convolutional filter on non-Gaussian input distributions. Our
theory also justifies the two-stage learning rate strategy in deep neural
networks. While our focus is theoretical, we also present experiments that
illustrate our theoretical findings.Comment: Published as a conference paper at ICLR 201
Algorithmic Regularization in Learning Deep Homogeneous Models: Layers are Automatically Balanced
We study the implicit regularization imposed by gradient descent for learning
multi-layer homogeneous functions including feed-forward fully connected and
convolutional deep neural networks with linear, ReLU or Leaky ReLU activation.
We rigorously prove that gradient flow (i.e. gradient descent with
infinitesimal step size) effectively enforces the differences between squared
norms across different layers to remain invariant without any explicit
regularization. This result implies that if the weights are initially small,
gradient flow automatically balances the magnitudes of all layers. Using a
discretization argument, we analyze gradient descent with positive step size
for the non-convex low-rank asymmetric matrix factorization problem without any
regularization. Inspired by our findings for gradient flow, we prove that
gradient descent with step sizes () automatically balances
two low-rank factors and converges to a bounded global optimum. Furthermore,
for rank- asymmetric matrix factorization we give a finer analysis showing
gradient descent with constant step size converges to the global minimum at a
globally linear rate. We believe that the idea of examining the invariance
imposed by first order algorithms in learning homogeneous models could serve as
a fundamental building block for studying optimization for learning deep
models.Comment: In NIPS 201
Computationally Efficient Robust Estimation of Sparse Functionals
Many conventional statistical procedures are extremely sensitive to seemingly
minor deviations from modeling assumptions. This problem is exacerbated in
modern high-dimensional settings, where the problem dimension can grow with and
possibly exceed the sample size. We consider the problem of robust estimation
of sparse functionals, and provide a computationally and statistically
efficient algorithm in the high-dimensional setting. Our theory identifies a
unified set of deterministic conditions under which our algorithm guarantees
accurate recovery. By further establishing that these deterministic conditions
hold with high-probability for a wide range of statistical models, our theory
applies to many problems of considerable interest including sparse mean and
covariance estimation; sparse linear regression; and sparse generalized linear
models
On Stationary-Point Hitting Time and Ergodicity of Stochastic Gradient Langevin Dynamics
Stochastic gradient Langevin dynamics (SGLD) is a fundamental algorithm in
stochastic optimization. Recent work by Zhang et al. [2017] presents an
analysis for the hitting time of SGLD for the first and second order stationary
points. The proof in Zhang et al. [2017] is a two-stage procedure through
bounding the Cheeger's constant, which is rather complicated and leads to loose
bounds. In this paper, using intuitions from stochastic differential equations,
we provide a direct analysis for the hitting times of SGLD to the first and
second order stationary points. Our analysis is straightforward. It only relies
on basic linear algebra and probability theory tools. Our direct analysis also
leads to tighter bounds comparing to Zhang et al. [2017] and shows the explicit
dependence of the hitting time on different factors, including dimensionality,
smoothness, noise strength, and step size effects. Under suitable conditions,
we show that the hitting time of SGLD to first-order stationary points can be
dimension-independent. Moreover, we apply our analysis to study several
important online estimation problems in machine learning, including linear
regression, matrix factorization, and online PCA.Comment: 41 page
An Improved Gap-Dependency Analysis of the Noisy Power Method
We consider the noisy power method algorithm, which has wide applications in
machine learning and statistics, especially those related to principal
component analysis (PCA) under resource (communication, memory or privacy)
constraints. Existing analysis of the noisy power method shows an
unsatisfactory dependency over the "consecutive" spectral gap
of an input data matrix, which could be very small
and hence limits the algorithm's applicability. In this paper, we present a new
analysis of the noisy power method that achieves improved gap dependency for
both sample complexity and noise tolerance bounds. More specifically, we
improve the dependency over to dependency over
, where is an intermediate algorithm parameter and
could be much larger than the target rank . Our proofs are built upon a
novel characterization of proximity between two subspaces that differ from
canonical angle characterizations analyzed in previous works. Finally, we apply
our improved bounds to distributed private PCA and memory-efficient streaming
PCA and obtain bounds that are superior to existing results in the literature
Gradient Descent Provably Optimizes Over-parameterized Neural Networks
One of the mysteries in the success of neural networks is randomly
initialized first order methods like gradient descent can achieve zero training
loss even though the objective function is non-convex and non-smooth. This
paper demystifies this surprising phenomenon for two-layer fully connected ReLU
activated neural networks. For an hidden node shallow neural network with
ReLU activation and training data, we show as long as is large enough
and no two inputs are parallel, randomly initialized gradient descent converges
to a globally optimal solution at a linear convergence rate for the quadratic
loss function.
Our analysis relies on the following observation: over-parameterization and
random initialization jointly restrict every weight vector to be close to its
initialization for all iterations, which allows us to exploit a strong
convexity-like property to show that gradient descent converges at a global
linear rate to the global optimum. We believe these insights are also useful in
analyzing deep models and other first order methods.Comment: ICLR 201
Understanding the Acceleration Phenomenon via High-Resolution Differential Equations
Gradient-based optimization algorithms can be studied from the perspective of
limiting ordinary differential equations (ODEs). Motivated by the fact that
existing ODEs do not distinguish between two fundamentally different
algorithms---Nesterov's accelerated gradient method for strongly convex
functions (NAG-SC) and Polyak's heavy-ball method---we study an alternative
limiting process that yields high-resolution ODEs. We show that these ODEs
permit a general Lyapunov function framework for the analysis of convergence in
both continuous and discrete time. We also show that these ODEs are more
accurate surrogates for the underlying algorithms; in particular, they not only
distinguish between NAG-SC and Polyak's heavy-ball method, but they allow the
identification of a term that we refer to as "gradient correction" that is
present in NAG-SC but not in the heavy-ball method and is responsible for the
qualitative difference in convergence of the two methods. We also use the
high-resolution ODE framework to study Nesterov's accelerated gradient method
for (non-strongly) convex functions, uncovering a hitherto unknown
result---that NAG-C minimizes the squared gradient norm at an inverse cubic
rate. Finally, by modifying the high-resolution ODE of NAG-C, we obtain a
family of new optimization methods that are shown to maintain the accelerated
convergence rates of NAG-C for smooth convex functions.Comment: 82 pages, 11 figure
- …